Rationale¶
For some certain loss functions, such the the negative evidence lower bound (NELBO) in variational inference, they are generally analytically intractable and thus unavailable in closed-form. As such, we might need to resort to taking stochastic estimates of the loss function. In these situations, it is very important to study and understand the robustness of the estimations we are making, particularly in terms of bias and variance. When proposing a new estimator, we may be interested in evaluating the loss at a fined-grained level - not only per batch, but perhaps even per data-point.
This notebook explores storing the recorded losses in Pandas Dataframes. The recorded losses are 3d, with dimensions corresponding to epochs, batches, and data-points. Specifically, they are of shape (n_epochs, n_batches, batch_size). Instead of using the deprecated Panel functionality from Pandas, we explore the preferred MultiIndex Dataframe.
Lastly, we play around with various data serialization formats supported out-of-the-box by Pandas. This might be useful if the training is GPU-intensive, so the script runs and records the loss remotely on a supercomputer, and we must write the results to file, download them and finally analyze them locally. This is usually trivial, but it is unclear what the behaviour is for more complex MultiIndex dataframes. We restrict our attention to the CSV format, which is human-friendly but very slow and inefficient, and the HDF5, which is basically diametrically opposed - it's basically completely inscrutable, but is very fast and takes up laess space.